There are two ways to use Metatab data package resources in Pandas. One is to use the CSV files directly, which is easy to do if the package is published to a repository. However, it is better to use the Metatab module to load the package metadata and create dataframes.
The simplest was to use the file in a metatab package is to load it's CSV file directly. You can get the CSV file URL from the data repostory page, such as this page for the ADOD Prevalence data in the San Diego Elder Dementia dataset.
While this is simple and portable, it does not give you the features of Metatab, such as built in schema documentation.
In [3]:
import pandas as pd
df = pd.read_csv('http://s3.amazonaws.com/library.metatab.org/sandiegocounty.gov-adod-2012-sra-3/data/adod-prevalence.csv')
df.head()
Out[3]:
The second way to access a package is to use the metatab package. This method requires installing the metatab python package, but has some important advantages: it gives you direct access to package and dataset documentation. You can load any type of metatab package with the open_package()
function, but for the highest performance, you should use the CSV package. Opening CSV package loads only the metadata and the resources you need, while using a ZIP or Excel packackage requires downloading the entire package first.
To find the CSV package in a package that is publiched to a CKAN repository, look for a CSV file with the description of "CSV Package Metadata in Metatab format". For the ADOD package, this file is named sandiegocounty.gov-adod-2012-sra-3.csv
.
Opening the package returns a Metatab document object. If you display it in Jupyter, the output cell will display the package documentation.
In [7]:
import metatab
doc = metatab.open_package('http://s3.amazonaws.com/library.metatab.org/sandiegocounty.gov-adod-2012-sra-3.csv')
doc
Out[7]:
The .resource()
method will return one of the resoruces. Displaying it shows the resoruce documentation.
In [4]:
r = doc.resource('adod-prevalence')
r
Out[4]:
Once you have a resource, use the .dataframe()
method to get a Pandas dataframe.
In [6]:
df = r.dataframe()
df.head()
Out[6]:
In [ ]: